Generating the training and test set involves using the ocbio.extract module with the chosen gold standard positive and negative datasets. This notebook is supposed to act like a script to do this, with documentation inline.

First, the datasource table must be regenerated at the top directory containing the data:


In [1]:
cd ../../


/data/opencast/MRes

In [2]:
import csv

As the data repository has now been annexed the datasource table must first be unlocked:


In [3]:
!git annex unlock datasource.tab


unlock datasource.tab (copying...) ok

In [4]:
#this script should be updated to add new features when available
f = open("datasource.tab", "w")
c = csv.writer(f,delimiter="\t")
# Gene Ontology features
c.writerow(["Gene_Ontology","Gene_Ontology","generator=geneontology/testgen.pickle"])
# Y2H SVM feature
c.writerow(["Y2H/Y2H.txt","Y2H/Y2H.db","valindexes=(4);ignoreheader=1;zeromissing=1"])
# ENTS feature
c.writerow(["ENTS","ENTS","generator=ents/human.ENTS.features.pickle"])
# ENTS summary feature
c.writerow(["ENTS_summary","ENTS_summary","generator=ents/human.Entrez.ENTS.summary.pickle"])
f.close()

Importing ocbio.extract

Next, ocbio.extract must be added to the path and imported:


In [5]:
import sys

In [6]:
sys.path.append("opencast-bio/")

In [7]:
import ocbio.extract

In [8]:
reload(ocbio.extract)


Out[8]:
<module 'ocbio.extract' from 'opencast-bio/ocbio/extract.pyc'>

Unlocking databases

Now that the data directory has been annexed the database files must first be unlocked:


In [9]:
!git annex unlock Y2H/Y2H.db


unlock Y2H/Y2H.db (copying...) ok

Initialising assembler

Then an assembler object must be initialised using the data source table:


In [10]:
assembler = ocbio.extract.FeatureVectorAssembler("datasource.tab", verbose=True)


Using  from top data directory datasource.tab.
Reading data source table:
	Data source: Gene_Ontology to be processed to Gene_Ontology
	Data source: Y2H/Y2H.txt to be processed to Y2H/Y2H.db
	Data source: ENTS to be processed to ENTS
	Data source: ENTS_summary to be processed to ENTS_summary
Initialising parsers.
Database Y2H/Y2H.db last updated 2014-06-25 12:15:04
Finished Initialisation.

Regenerating features

Then all the features should be regenerated to ensure they are up to date:


In [11]:
assembler.regenerate(verbose=True)


Regenerating parsers:
	 parser 0
Custom generator function, no database to regenerate.
	 parser 1
Database Y2H/Y2H.db last updated 2014-06-25 12:15:04
	 parser 2
Custom generator function, no database to regenerate.
	 parser 3
Custom generator function, no database to regenerate.

Generating training set

Using a set of positive interactions found through the iRefIndex project created in this notebook we can create a set of positive and negative feature vectors to train the classifier with:


In [12]:
assembler.assemble("iRefIndex/human.iRefIndex.positive.pairs.txt",
                   "features/human.iRefIndex.positive.vectors.txt",verbose=True)


Reading pairfile: iRefIndex/human.iRefIndex.positive.pairs.txt
Checking feature sizes:
	 Data source Gene_Ontology produces features of size 90.
	 Data source Y2H/Y2H.txt produces features of size 1.
	 Data source ENTS produces features of size 107.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors..................
Wrote 188833 vectors.
Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from Gene_Ontology
Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from Y2H/Y2H.txt
Matched 38.39 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from ENTS
Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from ENTS_summary

In [13]:
assembler.assemble("iRefIndex/human.iRefIndex.negative.pairs.txt",
                   "features/human.iRefIndex.negative.vectors.txt",verbose=True)


Reading pairfile: iRefIndex/human.iRefIndex.negative.pairs.txt
Checking feature sizes:
	 Data source Gene_Ontology produces features of size 90.
	 Data source Y2H/Y2H.txt produces features of size 1.
	 Data source ENTS produces features of size 107.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors...................................................................................................
Wrote 997760 vectors.
Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from Gene_Ontology
Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from Y2H/Y2H.txt
Matched 29.69 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from ENTS
Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from ENTS_summary

Generating active zone vectors

To apply our classifier to the interactions in the Active Zone network we will need feature vectors corresponding to those interactions. These can be found in the following file:


In [14]:
assembler.assemble("forGAVIN/mergecode/OUT/edgelist.txt",
                   "features/human.activezone.txt",verbose=Tfeatures/


Reading pairfile: forGAVIN/mergecode/OUT/edgelist.txt
Checking feature sizes:
	 Data source Gene_Ontology produces features of size 90.
	 Data source Y2H/Y2H.txt produces features of size 1.
	 Data source ENTS produces features of size 107.
	 Data source ENTS_summary produces features of size 1.
Writing feature vectors
Wrote 9375 vectors.
Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from Gene_Ontology
Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from Y2H/Y2H.txt
Matched 42.74 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from ENTS
Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from ENTS_summary